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Abstract 

Also known as likelihood-free methods, approximate Bayesian computational (ABC) meth- 
ods have appeared in the past ten years as the most satisfactory approach to intractable like- 
lihood problems, first in genetics then in a broader spectrum of applications. However, these 
methods suffer to some degree from calibration difficulties that make them rather volatile in 
their implementation and thus render them suspicious to the users of more traditional Monte 
Carlo methods. In this survey, we study the various improvements and extensions brought on 
the original ABC algorithm in recent years. 
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1 Introduction 



Conducting a Bayesian analysis in situations where the UkeUhood function i{d\y) is not available 
raises a computational issue. The likelihood may be unavailable for mathematical reasons (it is 
not available in closed from as a function of 0) or for computational reasons (it is too expensive 
too calculate). 

In some specific settings, the likelihood is expressed as an intractable multidimensional inte- 
gral 

i{e\y) = y"r(0|y,u)du, 

where y G C M" is observed, u G M'' a latent vector and 6 G M'^' the parameter of interest. For 
instance, when facing coalecent models in population genetics (see, e.g. Tavare et"al| 19971, y is 
the genotypes of the present sample, while u stands for their genealogical tree and the genotypes 
of their ancestors. In the particular set-up of hierarchical models with partly conjugate priors, it 
may be that the corresponding conditional distributions can be simulated and this property leads to 



a Gibbs sampler (Gelfand and Smith 1990 1. Such a decomposition is not available in general and 



there is no generic way to implement an MCMC algorithm like the Metropolis-Hastings algorithm 
(see, e.g., Robert and Casella] 2004 ; Marin and Robert 2007[ ). Typically, the increase in dimension 
induced by the data augmentation from to u may be such that the convergence properties of the 
corresponding MCMC algorithms are too poor for the algorithm to be considered. 
In others situations, the normalizing constant of the likelihood Zq is unknown 

my) = ii{o\y)/Z0. 

This is typically the case of Gibbs random fields used to model the dependency within spatially 



correlated data, with applications in epidemiology and image analysis, among others (e.g. Rue 
and Held ( 2005| l). For such models, a solution relying on the simulation of pseudo- samples has 
been proposed by |M0ller et al (2006 1. However the dependency of this solution on a pseudo-target 
distribution makes it difficult to calibrate ( Cucala et~all 2009} Friel and Pettitt 2008) in general 
settings. 

Bayesian inference thus faces a large class of settings where the likelihood function is not com- 
pletely known, e.g. £{d\y) = £i{d\y)£2{d) with £2 unknown, and where exact simulation from the 
corresponding posterior distribution is impractical or even impossible. Such settings call for prac- 



tical if cruder approximations methods. In the past, Laplace approximations (Tierney and Kadane 



I986| ) and variational Bayes solutions ( [Jaakkola and Jordanj |2000| ) have been advanced for such 
problems. However, Laplace approximations require some analytic knowledge of the posterior 
distribution, while variational Bayes solutions replace the true model with another pseudo-model 
which is usually much simpler and thus misses some of the features of the original model. 

The ABC methodology, where ABC stands for approximate Bayesian computation, was men- 
tioned as early as 1984 through a pedagogical and philosophical argument in |Rubin|(|1984|). It 



offers an almost automated resolution of the difficulty with models which are intractable but can 
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be simulated from. It was first proposed in population genetics by Tavare et al ( 1997| ), who in- 
troduced approximate Bayesian computational methods as a rejection technique bypassing the 
computation of the likelihood function via a simulation from the corresponding distribution. The 



exact version of the method cannot be implemented but in a very small range of cases. Pritchard 



et al ( 1999| ) produce a generalisation based on an approximation of the target. We study here the 



foundations as well as the implementation of the ABC method, with illustrations from time series. 

This survey describes the genesis of the ABC approach and its justifications (Section |2]l, the 
calibration of the method (Section[3]l, recent sequential improvements (Section|4]l, post-processing 
of ABC outputs (Section|5]l, and the specific application of ABC to model choice (Section[6]). The 
illustrations of the ABC methodology are based on the posteriors of the MA(2) and MA( 1 ) models 
for which the true posterior distribution can be computed; the impact of the ABC approximation 
can thus be assessed. We do not cover the increasingly wide array of applications of ABC here 
here; see'Csillery et al (2010a) for a survey of implementations of ABC in genomics and ecology. 
Neither do we address the controversy raised by Templeton (2008 2010 ) about the lack of validity 



of the ABC approach in statistical testing. Answers to those criticisms are provided in Beaumont 
etall([20T0l);|Csillery et all(|2010b|);perger et al|(|20T0l), among others. 



2 Genesis of the ABC approach and justifications 



Prehistory Rubin] ( [T984 ) advances a visionary statement that 'Bayesian statistics and Monte 
Carlo methods are ideally suited to the task of passing many models over one dataset' . Further- 



more, he produces in this paper a description of the first ABC algorithm. Followed by Tavare 



et al ( 1997j), the original ABC algorithm is in fact a special case of an accept-reject method (see, 
e.g., Robert and Casella 2004), where the parameter 6 is generated from the prior 7l{6) and the 



acceptance is conditional on the corresponding simulation of a sample being 'almost' identical to 
the (true) observed sample, which is denoted y throughout this paper. For the original algorithm 
given below (and solely for this algorithm), we suppose that y takes values in a finite or countable 
set 



Algorithm 1 Likelihood-free rejection sampler 1 
for / = 1 to do 
repeat 

Generate 6' from the prior distribution 7i{-) 
Generate z from the likelihood f{-\6') 
until z = y 

set di = e', 

end for 
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It is straightforward to show that the outcome 62, • • • , On) resulting from this algorithm 
is an iid sample from the posterior distribution since 



°c7r(0,-|y). 



Rubin (1984 1 does not promote this simulation method in situations where the likelihood is not 
available but rather exhibits it as an intuitive way to understand posterior distributions from a 
frequentist perspective, because parameters from the posterior are more likely to be those that 
could have generated the observed data. (The issue of the zero probability of the exact equality 
between simulated and observed data in continuous settings is not addressed in the original paper, 
presumably because the very notion of a 'match' between simulated and observed data is not 
precisely defined.) 



The first ABC In a population genetics setting, Pritchard et al ( 1999 1 extend the above algorithm 



to the case of continuous sample spaces, producing the first genuine ABC algorithm, defined as 
follows 

Algorithm 2 Likelihood-free rejection sampler 2 
for / = 1 to do 
repeat 

Generate 6' from the prior distribution 7i{-) 
Generate z from the likelihood f{-\6') 
until p{T7(z), 77 (y)} <£ 
set di = B', 
end for 

where the parameters of the algorithm are 

- 77 , a function on & defining a statistic which most often is not sufficient, 

- p > 0, a distance on T] (^), 

- £ > 0, a tolerance level. 

The likelihood-free algorithm above thus samples from the marginal in z of the joint distribu- 



tion 



7r(e)/(z|e)iA.,(z) 

7re 0,z y „(Q^rt iflN . .fl . (1) 



where Ifi(-) denotes the indicator function of the set B and 

Ae,y = {zG^|p{Tj(z),T7(y)}<£}. 



The basic idea behind ABC is that using a representative (enough) summary statistic rj coupled 
with a small (enough) tolerance e should produce a good (enough) approximation to the posterior 
distribution, namely that 

^e{d\y) = J ne{d,z\y)dz^n{d\y). 

Before moving to the extensions of the above algorithm, let us consider a simple dynamic 
example. 



Example The MA{q) process is a stochastic process {yk)keN* defined by 

1 

yk = Uk + Y, ^iUk-i , (2) 



1=1 



where {uk)ke'L is an iid sequence of standard Gaussians ^(0, 1). Even though a Bayesian analysis 
can handle non-identifiable settings and still estimate properly identifiable quantities (see, e.g.. 



Marin and Robert 2007 [ Chapter 5), we will impose a standard identifiability condition on this 



model, namely that the roots of the polynomial 

q 

B{x) = \-Y, Oix' 

i=l 

are all outside the unit circle in the complex plane. A simple prior distribution is therefore the 
uniform distribution over the corresponding range of 0,'s, especially when q is small and the set 
of resulting parameters is easy to describe. In the case processed in the figures below for q = 2, 
we obtain the triangle 

-2 < 01 < 2 , 01 + 02 > -1 , 01 - 02 < 1 . 

Although the prior on is very simple, and despite the Gaussian nature of the random vari- 
ables, the likelihood associated with a series {yk)\<k<n is more complex because of the need to 



integrate out u-q+i,. . . ,m_i,mo. (The easier alternative is to condition on {yk)i<k<q, see Marin 



and Robert 2007 even though the general case can also be handled by MCMC simulations as the 
likelihood is available, at least for small values of n.) 

Running one iteration of ABC in this setting then simply requires 

(a) simulating the MA(^) coefficients uniformly over the acceptable range, 

(b) generating an iid sequence {uk)-q<k<n, 

(c) producing a simulated series {zk)\<k<n- 
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Figure 1: Comparison of the level sets (in black) of the true posterior distribution with the scatter 
plot (in blue) of an ABC sample when using autocovariances as summary statistics. The threshold 
£ is chosen so that 0.1% of the N = 10^ simulated datasets are accepted. The observed dataset 
has been drawn from an MA(2) model with n = 100 epochs and parameter 6 = (0.6, 0.2) (the red 
dot). The triangle is the range of acceptable values of d. 



Depending on the focus of the analysis, the distance can be the raw distance between the series 



P^{{Zk)l<k<n,iyk)l<k<n} = J^{yk-zkf 

k=\ 

or the quadratic distance between summary statistics hke the first q autocovariances 



k=j+\ 

which is our choice for the illustration provided in Figure [T] This experiment shows how an ABC 
sample fits the level sets of the true posterior density for a simulated sample of length 100 using 
the parameters (61, 62) = (0.6,0.2) and a tolerance level equal to the 0.1% quantile of the sample 
of the distances. (The level sets were computed from the exact likelihood for the MA(2) model 
and a grid of values of 6 over the acceptable range.) This plot illustrates how the distribution of 
the sample points departs from true posterior: the approximation does not reconstruct the posterior 
perfectly. Decreasing e would lead to a better concentration of the posterior density on the level 
sets, but at the expense of the size of the resulting sample or at a higher computing cost. < 



MCMC-ABC In practice, using simulations from the prior distribution n{-) is inefficient be- 
cause this does not account for the data at the proposal stage and thus leads to proposed values 
located in low posterior probability regions. As an answer to this problem. Marjoram et al| (2003 1 



introduce an MCMC-ABC algorithm (Algorithm [3]) targeting the approximate posterior distribu- 
tion Tie of equation ([T]). 

Algorithm 3 Likelihood-free MCMC sampler 

Use Algorithm [2] to get a realisation {Q^^\z'^^^) from the ABC target distribution 7re(0,z|y) 
for f = 1 to do 

Generate B' from the Markov kernel q ^- 1 6^'^^^ , 

Generate z' from the likelihood f{-\d'). 
Generate u from ^[o^j, 
^ ^r(0j|(0(^^^[0) 

- 7r(0('-i))^(0'|e('-i)) ' 
set(0W,zW) = (0',z') 
else 

(0W,zW) = (0('-i),z('-')), 
end if 
end for 
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The acceptance probability used in Algorithm [3] does not involve the calculation of the likeli- 
hood and it thus satisfies ABC requirements. It also produces an MCMC algorithm which exactly 
targets ne{d,z\y) as its stationary distribution. Indeed, 

ne{d',z'\y) q{d^'-'^\d')f{zi'-''^\e'^'-'^) 



7re(0('-'),z('-l)|y) g(0'|0('-l))/(z'|0') 

7iie')fiz'\e')iA,„iz') 

7r(e('-^))/(z('-i)|0('-^))lA,,,(z('-i)) 

g(e('-')|eO/(z('-')|e('-i)) 

^ q{d'\d^'-^^) f{z'\d') 

n{d')q{d^ '-'^\d') 



The initialisation of the MCMC sampler with the rejection sampler (Algorithm 2) can be by- 
passed since the Markov chain forgets its initial state. The computational cost of the initialisation 
is then reduced. But then we have to run the MCMC longer to achieve convergence and omit 
the bum-in first iterations from the output, which also has a computational cost. As noted above, 
the ABC approximation depends on tuning parameters (the summary statistic rj, the tolerance £, 
and the distance p) that have to be chosen prior to running the algorithm and the calibration of 
which is discussed in most of the literature. The tolerance e is somewhat the easiest aspect of this 
calibration issue in that, when e goes to zero, the ABC algorithm becomes exact. 



Noisy ABC Wilkinson (2008 1 proposes to switch perspective, replacing the approximation error 
resulting from the loose acceptance condition in the above Ukelihood-free samplers with an exact 
inference from a controlled approximation of the target, essentially a convolution of the regular 
target with an arbitrary kernel function. The corresponding ABC target is thus 



^e(0,z|y) 



KiG)fiz\d)K,iy- 



jK{e)f{z\e)Ke{y-z)dzdd 



(3) 



where Ke is a well-chosen kernel parameterised by the bandwidth e. This perspective is interesting 
in that the outcome is completely controlled, due to the degree of freedom brought by the choice 



of the kernel. Wilkinson (2008 ) makes the valuable point that if the model includes an error term. 



then taking the distribution of that eiTor term to be leads to an ABC algorithm which simulates 



exactly from the error-in- variables posterior. In practice, Wilkinson's ( 2008 1 approach requires a 
modification of the standard ABC algorithms, taking into account the kernel for the simulation 
of z. The new algorithm which includes an accept-reject step imposes an upper bound on the 
convolution kernel K^. 



This perspective of the "noisy ABC" is also adopted by Feamhead and Prangle (20101 who 
study the convergence of ABC based inference. They show that the convolution induced by 
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the kernel representation leads to the true parameter being the maximum of the integrated log- 
likelihood and thus that a Bayes estimator is converging to the true value when the number of 
observations goes to infinity and the tolerance level goes to zero. They also stress the connection 



with the econometrics approach of indirect inference Gourieroux et al ( 1993 1. 



ABC Filtering Jasra et al (201 1 1 propose an ABC scheme for filtering when the distribution of 



the observables conditioned on the hidden state is not available point-wise, related to the convolu- 
tion particle filter of [Campillo and Rossi (2009 1. It is particularly appealing in that it allows com- 
plex (hence realistic) statistical models for filtering. Theoretical arguments are given to prove that 
the ABC approximation of the filter does not accumulate errors along the sequence of observables, 



when the model has good mixing properties. Dean et al (2011 1 illustrate this implementation in the 
specific case of hidden Markov (HMM) models, relating the ABC implementation with Wilkin- 
son's ( |2008 1 perspective and demonstrating that the pseudo (or noisy) model for which ABC is 
exact also is an HMM. Using this representation, they further establish ABC consistency. While 



Dean et al (2011 1 establish that ABC leads to an asymptotic bias for a fixed value of the toler- 
ance £, they also prove that an arbitrary accuracy can be attained with enough data and a small 
enough e. (We note that the restriction to summary statistics that preserve the HMM structure 
is paramount for the results in the paper to apply, hence preventing the use of truly summarising 
statistics that would not grow in dimension with the size of the HMM series.) The convergence 
result central to Dean et al ( |2011 1 is also connected with Feamhead and Prangle's (2010) version, 
mentioned above, in that they both rely on pseudo-likelihood consistency arguments. 



3 Calibration of ABC 

Summary statistics Several authors have considered the fundamental difficulty associated with 
the choice of the summary statistic, T](y), which one would like to consider as a quasi-sufficient 



statistic. First, for most real problems (a notable exception being found in Grelaud et al 2009 
in the case of Gibbs random fields), it is impossible to find non-trivial sufficient statistics which 
would eliminate the need of a choice of statistics. Second, the summary statistics of interest are 
usually determined by the problem at hand and chosen by the experimenters in the field. 

Assuming a large collection of summary statistics is available, Joyce and Marjoram] (2008 1 



consider the sequential inclusion of those statistics into the ABC target. The inclusion of a new 
statistic within the set of summary statistics is assessed in terms of a likelihood ratio test, without 
taking into account the sequential nature of the tests. We have reservations about the method, 
first and foremost that the construction of the statistics is not discussed, while the method is not 
independent from parametrisation, and also that the order in which the statistics are considered 
is paramount for their inclusion/exclusion. A regularisation of the method proposed at the end of 
the paper is to use a forward-backward selection mechanism to address this last issue. However, 
this correction does not address another issue, namely the impact of the correlation between the 
summary statistics. Note at last that Joyce and Marjoram's (2008) method still depends on an 
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approximation factor that needs to be calibrated prior to running the algorithm. In his thesis, 
Ratmann|(|20091) proposes a similar examination of the successive inclusion of various statistics. 



A related perspective is that of McKinley et al ( 2009| ). They perform a simulation experi- 



ment comparing ABC-MCMC and ABC-SMC (discussed below) with regular data augmentation 
MCMC. The authors test strategies to select the tolerance level, and to choose the distance p and 
the summary statistics. The conclusions are not very surprising, in that 

(a) repeating simulations of the data points given one simulated parameter does not seem to 
contribute to an improved approximation of the posterior by the ABC sample, 

(b) the tolerance level does not seem to have a strong influence, 

(c) the choice of the distance, of the summary statistics and of the calibration factors are 
paramount to the success of the approximation, and 

(d) ABC-SMC outperforms ABC-MCMC (MCMC remaining the reference). 



Feamhead and Prangle ( 2010[ | study the selection of summary statistics with the interesting 



perspective that ABC is then considered from a purely inferential viewpoint and calibrated for 
estimation purposes. (This contrasts with most alternative perspectives that envision ABC as 



a poor man's non-parametric estimation of the posterior distribution.) Fearnhead and Prangle 
( j2010) rely on a randomised version of the summary statistics from which they derive a well- 
calibrated version of ABC, i.e. an algorithm that gives proper predictions of given quantities. The 
authors consider choices of summary statistics, and establish that the posterior expectations of the 
parameters of interest are optimal summary statistics, although this follows from their choice of 
loss function. 

Tolerance threshold and ABC approximation error As noted above, the choice of the toler- 
ance level £ is mostly a matter of computational power: smaller e's are associated with higher 
computational costs and the standard practice ( jBeaumont et al[ 2002| ) is to select £ as a small 
percentile of the simulated distances 

p{T](z),T](y)}. 

An alternative described below is to set the ABC algorithm within the non-parametric setting of 
density estimation, in which case £ is understood as a bandwidth and can be derived from the 



simulated population. As noted in Fearnhead and Prangle (20101, this perspective implies that the 



optimal £ is then different from zero. 

Standing rather apart from other contributions to the field, Ratmann et al ( 2009| ) provide an 



intrinsically novel way of looking at the ABC approximation error (and hence at the tolerance). It 
is presented as a tool assessing the goodness of fit of a given model. The fundamental idea there 
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is to use the tolerance e as an additional parameter of the model, simulating from a joint posterior 
distribution 

f{e,e\y)oc^(e\y,d)Ke{d)K,{e), 

where ^{e\y,d) plays the role of the likelihood, and tiq and TTg are the corresponding priors on 6 
and £. In this approach, ^ (£|y, 6) is the prior predictive density of pjij (z), T] (y)} given 6 and y 



when z is distributed from f{z\6). We note here a connection with Wilkinson's (2008 1 target Q 
in that n{d)f{z\d)Ke{y — z) is identical to the above once we replace y — z by £. 

[Ratmann et al^ ( ^2009^ then derive an ABC algorithm they call ABC^ to simulate an MCMC 
chain targeting this joint distribution, replacing ^ (£|y, 6) with a non-parametric kernel approxi- 
mation. For each model under comparison, the marginal posterior distribution on the error e is 
then used to assess the fit of the model, the logic being that this posterior should include in a 
reasonable credible interval. While the authors stress they use the data once, they also define the 
above target by using simultaneously a prior distribution on a and a conditional distribution on 
the same e that they interpret as the likelihood in {£,6). The product is most often defined as a 
density in (e, 0), so it can be simulated from, but the Bayesian interpretation of the outcome is 
delicate, especially because it seems the prior on e contributes significantly to the final assessment 
of the model. As discussed in |Robert et al| ( |20f0l ), some of the choices of |Ratmann et al| ( |2009| ) can 



be argued about, in particular the ambivalent role of the approximation error. The most important 
aspect of the paper is that the original motivation of running ABC for conducting inference on the 
parameters of a model is replaced by the alternative goal of running ABC for assessing a model; 
see Ratmann et al's 2010 reply to the remarks made by [Robert et al (20101. . 



Example Returning to the MA(2) model, we study the impact of the choice of the distance and 
of the tolerance on the approximation. In this example, we simulated a sample of size 50 from a 
MA(2) model based on the same pai^ameters as above. First, we compare the impact of using the 
raw distance between the complete datasets instead of the distance between the autocovariances 
(introduced above). Figure |2] shows that the raw distance between the observed and the simulated 
time series is inefficient and fairly non-discriminative. For the raw distance, the spread of the 
parameters accepted after the ABC step is indeed much wider than for the second distance, espe- 
cially when compared with the level sets of the posterior density. We thus use only the distance 
between the autocovariances in the remainder of the paper. 

We now turn to the tolerance e. Figure [3] shows that decreasing e along empirical quantiles of 
the simulated distances p(T](z),T](y)) improves the approximation, although we never reach the 
true marginal densities (this is particularly true for the parameter 62-) The marginal densities of 
the ABC samples were obtained by the R default density estimator and the true marginal densities 
by numerical integration. -4 
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Figure 2: Scattering of two ABC samples when the computations are based on the autocovariance 
distance (left) and the raw distance (right), using different quantiles on the simulated distance for 
e(l%in blue, l%o in red, and 0.1 %o in yellow). The level sets of the posterior density are exhibited 
in black. 
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Figure 3: Evolution of the distribution of ABC samples using different quantiles for e (10% in 
blue, 1% in red, and 0.1% in yellow) when compared with the true marginal densities. The dataset 
is the same as in Figure [2] 



4 Sequential improvements 

Importance sampling Sequential techniques can enhance the efficiency of the ABC algorithm 
by learning about the target distribution, as in'Sisson et aFs (2007) partial rejection control (PRC) 
version. The ABC-PRC modification introduced by ,Sisson et al ( p007 1 consists in producing sam- 



ples (0 j'\ . . . , at each iteration 1 < ? < T of the algorithm by using a particle filter method- 
ology. Starting with a regular ABC step, the generation of the Q 
kernels Kt, 

■K,{6\e 



s relies on Markov transition 



0« ~ 



until z ~ /(z| •*) is such that p{'r] (z), rj (y)) < e, where Q* is selected at random among the pre- 

The probability (of^ is derived by an importance sampling 

7r(eP)L,_i(r|eP) 

' ~ 7r(r)^,(0f' 10'') 

where Lf_i is an arbitrary transition kernel. While this method is based upon the theoretical work 
of Del Moral et al ( 2006| ) and their SMC sampler, the application to approximate Bayesian compu- 
tation results in a bias in the approximation to the posterior, because the likelihood is removed in 
a standard ABC fashion ( |Sisson et al 2009| ). Replacing the likelihood with the indicator function 



vious '''s with probabilities OTj-'^ ^\ 
argument as 



(0. 
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provides an unbiased estimator of the likelihood that cannot be used as such in the denominator of 
a Metropolis-Hastings acceptance probability, hence the resulting bias. 

An alternative version called ABC-PMC and based on genuine importance sampling argu- 
ments, proposed by Beaumont et al ( 2009| l, bypasses this difficulty, in connection with the popu- 
lation Monte Carlo method of Douc et al ( |2007| ). It includes an automatic scaling of the forward 
kernel. The correction published in Sisson et al (2009) acknowledges the existence of a bias and 
suggests a correction essentially identical to the PMC solution of Beaumont et al (2009 1. 

As illustrated in the pseudo-code below, ABC-PMC constructs a kernel approximation to the 
target distribution based on earlier simulations and estimates the random walk scale (which is also 
the kernel bandwidth) from those simulations, using in addition a decreasing sequence of tolerance 
thresholds £\> . . ■>£t'- 



Algorithm 4 Likelihood-free population Monte Carlo sampler 
At iteration t = \, 
for i=\toN Ao 
repeat 

Simulate df^ ~ 7r(0) and z ~ /(z | df^) 
untilp(77(z),77(y)) <£i 
Set w/'' = \/N 
end for 

Take Zi as twice the empirical variance of the ©['^'s 
for f = 2 to r do 
for / = 1 to do 
repeat 

Pick 0* from the ^^'^^'s with probabihties coj^^^ 

Generate df ~ .J^{d1,Z,^i) and z ~ /(z | df) 
until p (t] (z), T] (y)) <£, 

Set c nief)/i!]^, coj'-V { {ef - e;.'-)) } 

end for 

Take as twice the weighted variance of the 0-'''s 
end for 



Another related paper is Toni et al's 2009| proposal of a parallel sequential ABC algorithm. 
Just like ABC-PMC, the ABC-SMC algorithm (an acronym found in several papers) developed 
therein is based on a sequence of simulated samples, Markov transition kernels, and importance 
weights rather than SMC justifications. The unavailable likelihood is estimated by the indicator of 
the tolerance zone or an average of indicators as in 'Marjoram et al ( 2003) ). The bulk of the paper is 
dedicated to the analysis of ODEs, using uniform distributions as transition kernels. The adaptivity 
of the ABC-SMC algorithm is restricted to a progressive reduction of the tolerance, since the 
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kernels Kf's remain the same across iterations, in contrast with the ABC-PMC motivation for 
tuning the Kf's to the target. The paper also contains a comparison with ABC-PRC, which shows 



a bias in the variance of the ABC-PRC output, in line with Beaumont et al (2009 1. 

[McKinley et al| ( [2009] ) have coded the parallel sequential ABC algorithm on an infectious dis- 
ease model (a recent outbreak of Ebola Haemorrhagic Fever in the Democratic Republic of the 
Congo — for which there is no known treatment and which is responsible for an 88% decline 
in observed chimpanzee populations since 2003!). They show that the ABC-SMC sampler out- 
performs ABC-MCMC (MCMC remaining the reference). The comparison experiment is based 
on a single dataset, with fixed random walk variances for the MCMC algorithms; note that the 
prior used in the simulation might be too highly peaked around the true value (gamma rates of 
0.1). Some of the ABC scenarios do produce estimates that are rather far away from the refer- 
ences given by MCMC, for instance CABC-MCMC when the threshold e is 10 and the number of 
repeats R is 100. 



Backward kernels and SMC Del Moral et al ( 2009 1 exhibit the connection between the ABC 



algorithm and the foundational SMC paper of Del Moral et al ( 2006 1 that inspired Sisson et al 



(2007 1. As opposed to the latter, and despite a common framework, this ABC-SMC paper properly 



relies on the idea of using backward kernels L, to simplify the importance weights and to remove 
from these weights the dependence on the unknown likelihood. A major assumption of , Del Moral] 
et al ( 2009| ) is that the forward kernels Kt are supposed to be invariant against the true target 
(which is a tempered-like version of the true posterior in sequential Monte Carlo), a choice not 



explicitely made in Sisson et al (2007 1. One of the novelties in the paper is that the authors 
rely on M repeated simulations of the pseudo-data z given the parameter, rather than using a 
single simulation. In that perspective, each simulated parameter gets a non-zero weight that is 
proportional to the number of accepted z's. The limiting case M — oo brings in an exact simulation 
from the tempered targets TTg/s, so there is a convergence principle and the stabilisation of the 
approximation could be assessed to calibrate M. The adaptivity in the ABC-SMC algorithm is 
found in the on-line construction of the thresholds: the thresholds decrease slowly enough to keep 
a large number of accepted transitions from the previous sample. An important feature is that the 
update in the importance weights simplifies to the ratio of the proportions of surviving particles, 
due to the choice of the reversal backward kernels and to the use of invariant transition forward 
kernels Kf. 



In a very related manner, Drovandi and Pettitt (2010 1 use a combination of particles and of 
MCMC moves to adapt a proposal to the true target, with acceptance probability 



min< 1 



K{d*)K{dc\d*) 

^' 7i{d*)K{e*\ec) 

where 6* is the proposed value, dc is the current value (picked at random from the particle popu- 
lation), and K is a. proposal kernel used to simulate the proposed value. The algorithm is adaptive 
in that the previous population of particles is used to make the choice of the proposal K, as well as 
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of the tolerance level The level of novelty of the method compared with 'Del Moral et al (2009 1 
is quite limited, since the paper adapts the tolerance on-line as an a-quantile of the previous par- 
ticle population. The convergence analysis which is omitted by Drovandi and Pettitt] ( |2010 



is 



perhaps not so standard, mainly because the MCMC is appUed only to half of the particle system. 



Del Moral et al (201 1 1 tackle the issue of adaptive resampling strategies. The only strong method- 



ological difference between the two papers is that the MCMC steps are now repeated 'numerous 
times'. However, this partly cancels the appeal of an 0(A'^) order method versus the 0{N^) order 
ABC-PMC and ABC-SMC methods. An interesting remark there is that advances are needed in 
cases when simulating the pseudo-observations is very costly, as in Ising models. However, re- 
placing exact simulation by a few steps from a Gibbs sampler as in Grelaud et al ( 2009 1 cannot be 
very detrimental to the convergence of an approximate algorithm. 



5 Post-processing of ABC output 

Local linear regression Improvements to the general ABC scheme have been achieved by view- 
ing the problem as a conditional density estimation and developing techniques to allow for larger 
£ (Beaumont et al 2002[ ). This is a post-processing scheme in that the simulation process per se 



does not change but the analysis of the ABC output does. The authors endeavour to include all 
simulated summary statistics, even those far away from the observed summary statistic, by shrink- 
ing the corresponding parameters in a linear manner. More specifically, they replace the simulated 
0's with 

0* = 0-{T7(z)-r7(y)}T^, 

where j8 is obtained by a weighted least squares regression of Q on (t](z) — f]{y)), using weights 
of the form 

Ks{p{^{z)My)]] , 

where Kg is a non-parametric kernel with bandwidth 8. 



Example We implement this correction of Beaumont et al| ( 2002 ) in the MA(2) model, again 



using the first two autocovariances as summary statistic 77 (z), and we apply a non-parametric local 
regression based on the Epanechnikov kernel. We keep 8 equal to the value of the tolerance e used 
in the regular ABC scheme. Figures |4] and [5] summarise the results. When using a 0.1% quantile, 
the two density estimates are identical in the case of the parameter 62- The post-processed density 
estimate of Q\ is closer to the true posterior. When using a 20% quantile, the impact of the local 
regression is more spectacular. We recover results close to those obtained with the 0.1% quantile. 
This exhibits the point that local regression strongly attenuates the impact of the truncation brought 
by £. < 
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Figure 4: Comparison of the density estimates of the distributions of the parameters using an 
ABC approximation with e as the 0.1% quantile on the autocovariance distances (in blue) and 



the [Beaumont et al| ( |2002| l correction ( in red). The red and blue curves are confounded for the 
parameter 02- 
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Figure 5: Comparison of the approximate distributions of the parameters using an ABC approxi- 



mation with e as the 20% quantile on the autocovariance distance ( in blue) and the Beaumont et al 



( 2002 1 correction (in red). 



Nonlinear regression Blum and Francois (2010 1 propose a generalisation of Beaumont et al s 



(2002) ABC post-processing where the local linear regression of the parameter 6 on the summary 
statistics rj (z) is replaced by a nonlinear regression with heteroskedasticity. In this new approach, 
the nonlinear mean and variance are estimated by a neural net with one hidden layer, using the R 
package nnet ( [R Development Core Team 2006| ). The result is interesting in that it seems to allow 
for the inclusion of more or even all the simulated pairs (0,z), compared with Beaumont et al. 
(2002). This is somehow to be expected since the nonlinear fit adapts differently to different parts 
of the space. Therefore, weighting simulated (0 , z) 's by a kernel Ks (z — y) is not very relevant and 
it is thus not surprising that the bandwith 5 is not influential, in contrast with basic ABC and even 
Beaumont et al. (2002) where 5 has a different meaning. The non-parametric perspective adopted 
in the paper is nonetheless of the highest importance, as it proves the most fruitful approach 
to the interpretation of ABC methods. In connection with this paper, Blum| ( 2010[ ) provides a 
good review of the non-parametric handling of ABC techniques. The true difficulty with the non- 
parametric perspective lies with the curse of dimensionality. This issue might be addressed by 
mixing dimension reduction with recycling by shrinking as in Beaumont et al. (2002). 



Inverse regression Leuenberger et al ( |2010 1 also relate to the local regression ideas in Beaumont 



et al. (2002). As in the earUer work by |Wilkinson| ( [2008 ), the approximation to the distribution 
of the parameters given the observed summary statistics is central to the paper. In opposition to 
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Beaumont et al (2002 1, there is no clear shrinkage for summary statistics that are far away from the 



observed summary statistics: all accepted parameters are weighted similarly in the Gaussian linear 
approximation to the truncated prior. The other difference with Beaumont et al ( 2002 1 is that the 
authors model z given 6 rather than 6 given z, in an inverse regression perspective, followed by a 
sort of Laplace approximation reminding Rue et al (2009 1. 



6 ABC and model choice 
6.1 Bayesian model choice 

Model choice is one particular aspect of Bayesian analysis that involves computational complex 
ity, if only because several models are considered simultaneously (see, e.g., |Robe7tl |2001t [Marin 



and Robertl 2010 1. In addition to the parameters of each model, the inference considers the model 
index which is associated with its own prior distribution 7r(./# = m) (m = 1, . . . ,M) as well 
as a prior distribution on the parameters conditional on the value m of the model index, 7im{0m), 
defined on the parameter space ©„,. The choice between these models is then driven by the pos- 
terior distribution of a challenging computational target where ABC brings a straightforward 
solution. Indeed, once ^ is incorporated within the parameters, the ABC approximation to the 
posterior follows from the same principles as regular ABC, as shown by the following pseudo- 
code, where Tj (z) = (t]i (z) , . . . , tjm(z)) is the concatenation of the summary statistics used for all 
models (with elimination of duplicates). 



Algorithm 5 Likelihood-free model choice sampler (ABC-MC) 
for / = 1 to do 
repeat 

Generate m from the prior = m) 
Generate 0„, from the prior 7im{dni) 
Generate z from the model /m(z|0m) 
until p{T] (z), T] (y)} <£ 
Set m(') =mand 0^ = 
end for 



The ABC estimate of the posterior probability 7i{^ 
from model m, namely 

1 ^ 



m|y) is then the acceptance frequency 



This also corresponds to the proportion of simulated datasets that are closer to the data y than 
the tolerance e. Comuet et al ( |2008 1 follow the rationale that led to the local linear regression 
in Beaumont et al ( 2002 1 and rely on a weighted polychotomous logistic regression to estimate 
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= m\y). This modeling clearly brings some further stability to the above estimate of 7i{^ = 
m\y) and is implemented in the DIYABC software described in Comuet et al (2008 1 . 



Example Returning once again to our benchmark MA (2) model, we compare the computation 
of the model posterior probabilities based on an ABC sample (acceptance frequency within each 
model) with the true value of the Bayes factor, which was obtained by numerical integration. The 
dataset used in the experiment is a time-series simulated and we wish to choose between two 
models: an MA(2) or an MA(1) model. Figure |6] shows our estimates for data simulated from ar 
MA(2) model. The weight of the MA(2) model increases slightly as e decreases. However, even 
for the quantile at 0.01% the estimated posterior probability for the MA(2) model is equal to 0.72 
which is far from the true value 0.95. Figure [7] shows a similar phenomenon for data simulated 
from an MA(1) model. < 




Figure 6: Boxplots of the evolution [against £] of ABC approximations to the Bayes fac- 
tor. The representation is made in terms of frequencies of visits to [accepted proposals from] 
models MA(1) (left) and MA(2) (right) during an ABC simulation when e corresponds to the 
10, 1,0.1,0.01% quantiles on the simulated autocovariance distances. The data are the same as in 
Figure[5] The true Bayes factor B21 is equal to 17.71, corresponding to posterior probabilities of 
0.05 and 0.95 for the MA(1) and MA(2) models respectively. 



The discrepancy in the above example shows the limitations of the ABC approximation of 
Bayes factors exposed in Robert et al (2011|. While we could expect to obtain a better approxi- 
mation with a massive computational effort, it may be that the use of different summary statistics 
for different models prevents us from converging to the true value. In other words, the concatena- 
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Figure 7: Boxplots of evolution of Bayes factor approximations in terms of frequencies of visits to 
models MA(1) (left) andMA(2) (right) using an ABC approximation with 10, 1, .1, .01% quantiles 
on the autocovariance distance as e. The dataset is a sample of 50 points from a MA(1) model 
with d\ = 0.6. The true Bayes factor B21 is equal to .004 corresponding to posterior probabilities 
of 0.996 and 0.004 for the MA(1) and MA(2) models respectively. 



tion of sufficient statistics for individual models does not always constitute a sufficient statistic for 
model choice, as discussed in the next paragraph. 

6.2 The case of Gibbs random fields 



Grelaud et al ( 2009| ) show that, for Gibbs random fields and in particular for Potts models, where 



the goal is to compare several neighbourhood structures, the computation of the posterior proba- 
bilities of the models under competition can be operated by likelihood-free simulation techniques. 
We recall first that Gibbs random fields are probabilistic models associated with the likelihood 
function 

^(0|y) = ^exp{0TT7(y)}, 
Ze 

where y is a vector of dimension n taking values over a finite set ^ (possibly a lattice), T](-) is 
the potential function defining the random field, taking values in W, 6 G M'' is the associated 
parameter, and Zq is the corresponding normalising constant. A special but important case of 
Gibbs random fields is associated with a neighbourhood structure denoted by / ~ /' (meaning that 
/ and /' are neighbours), in that 

^(y) = L%=.v,v}' 
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where I^,/^, indicates that the summation is over all the pairs of neighbours. In that case, is a 
scalar. 

The central property ensuring an ABC resolution for Gibbs random fields is that, due to their 
exponential family structure, there exists a sufficient statistic vector that runs across models and 
which allows for an exact (e = 0) simulation from the posterior probabilities of the models. Indeed, 
model choice involves M Gibbs random fields in competition; each field is associated with a 
potential function Tj^ (1 < "J < M), i.e. with the corresponding likelihood 

imiOm\y) = exp { 0j,T],„(y) } /Ze„„„, , 

where dm G @m and Ze„, ,„ is the unknown normalising constant. From a Bayesian perspective, 
considering an extended parameter space = U^^j{ni} x 0„, that includes the model index 
the computational target is thus the model posterior probability 



n{^ = m\y)oc £,„{d,ri\y)7tmidm)<iOm^ 



■ m] 



i.e. the marginal in ^ of the posterior distribution on 0i, . . . , 0m) given y. Each model 
has its own sufficient statistic Then, for each individual model, the vector of statistics 

T](-) = (r}i(-), . . . , T]m(')) is clearly sufficient. However Grelaud et al (20091 exposed the fact that 
T] is also sufficient for the joint parameter 0i, 



,0 



M) 



That the concatenation of the sufficient statistics of each model is also a sufficient statistic 
for the joint parameter across models is clearly a property that is specific to exponential families. 



As shown by Didelot et al (2011 1, ABC-based model choice can process exponential families 
by creating inter-model sufficient statistics that incorporate the intra-model sufficient statistics as 
well as possibly the dominating measures for all models. The Gibbs random field above is a 
specific case of this sufficiency. However, outside exponential families, the possibility of creating 
a sufficient statistic of a dimension that is much lower than the dimension of the data is impossible. 



as explained in Robert et al (201 1 1. 



6.3 General issues 



Toni et al (2009 1 and Toni and Stumpf] ( |2010 l review ABC-based model choice, inclusive of the 
above Gibbs random field example. The authors study in particular the consequences of imple- 
menting a sequential algorithm like ABC-PMC in this set-up. The ABC algorithm is modified to 
incorporate the model index, resorting to the previous assessment of = m\y) to propose the 
model indices of the next population. The importance sampling features of this setting imply that 
the posterior probability can be estimated from the importance weights. However, the adaptivity at 
the core of ABC-PMC and ABC-SMC implies adapting an approximation kernel for each model. 



As most other perspectives on ABC, Toni and Stumpf (2010 1 do not question the role of the ABC 



distance in model choice settings. The Bayes factors are observed to be sensitive to the choice of 
the prior distributions, of the tolerance levels, and to the variances of the kernels Kf (see Section 
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[4]l, a dependence that should not occur, since this is a simulation parameter that is unrelated with 
the statistical problem. 

It is worth pointing out the remark made by Leuenberger et al (2010) about model choice and 
the use of the approximation of the normalising constant resulting from the modelling to get to 
the marginal likelihood and the computation of the Bayes factor. This relates to earlier comments 
in the literature about the ABC acceptance rate approximating the marginal and a recent paper by 



Bartolucci et al (2006 1 studying ways of computing marginal probabilities by Rao-Blackwellising 



reversible jump acceptance probabilities. Grelaud et al ( |2009[ l also make the most of this ABC 
feature for Ising models, since an exact ABC (corresponding to £ = 0) algorithm is then available 
for model selection. 

A (minor) Bayesian issue mentioned by Ratmann et al ( |2009 1 is the fact that both 6 and e 
are taken to be the same across models. In a classical Bayesian perspective, modulo the reparam- 
eterisation, 6 cannot be entirely different from one model to the next, but using the same prior 
on £ over all models under comparison is more of an issue. The paper also considers the impact 
of testing for the adequacy of a model as testing for the hypothesis Hq : e = 0, an interesting if 
controversial stance, since even when the model fits, £ necessarily varies around zero. 

At this stage, the most perplexing feature of ABC model choice is the lack of convergence 
guarantees. As exposed in [Robert et al| (201 1 ), most settings where ABC model choice is imple- 
mented do not allow for inter-model sufficiency in the selection of the summary statistics, because 
some models are not within exponential families and because using the whole data is too demand- 
ing. As shown by the MA example above, this lack of sufficiency may be quite detrimental to the 
quality of the ABC approximation of the Bayes factors. There is therefore currently no theoretical 
support for the use of ABC approximations of Bayes factors and posterior model probabilities, and 
we thus advise for more empirical assessments in the spirit of Ratmann et al ( 2009[ ) that evaluate 
the model fit within each model without concluding by exact figures of the probabilities of the 
different models. 



7 Discussion 

Approximate Bayesian Computation allows inference from a wide class of models which would 
otherwise be unavailable. As such, it has spawned interest in both theoretical issues and applica- 
tions. Recent advances regarding the calibration of the method lead to an approximation that is 
good enough to be highly useful in many situations. The efficiency of the method can be greatly 
improved with sequential techniques and post-processing regression on the output. 

Nonetheless, ABC is not a silver bullet. In the current state of the art, it can only be used 
for model choice in a limited range of models. Future advances must at the same time expand 
further the tools to make ABC useful in a wider class of models, extend pre- and post-processing 
methods to control the approximation, and establish more clearly in which cases ABC reaches its 
limitations. 
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ABC methods are currently under an intense scrutiny by both statisticians and practitioners, 
hence the object of an unparalleled development. While this rapid development provides answers 
to some interrogations from the statistical community about the validity of the approach and from 
the practitioners about a higher efficiency of the method, some issues remain unsolved, among 
which: 

• the convergence results obtained so far are unpractical in that they require either the tol- 
erance to go to zero or the sample size to go to infinity. Obtaining exact error bounds for 
positive tolerances and finite sample sizes would bring a strong improvement in both the 
implementation of the method and in the assessment of its worth. 

• even though ABC is often presented as a converging method that approximates Bayesian 
inference, it can also be perceived as an inference technique per se and hence analysed in its 
own right. Connections with indirect inference have already been drawn, however the fine 
asymptotic analysis of ABC would be most useful to derive. Moreover, it could indirectly 
provide indications about the optimal calibration of the algorithm. 

• in connection with the above, the connection of ABC-based inference with other approxi- 
mative methods like variational Bayes inference is so far unexplored. Comparing and inter- 
breeding those different methods should become a research focus as well. 

• the construction and selection of the summary statistics is so far highly empirical. An auto- 
mated approach based on the principles of data analysis and approximate sufficiency would 
be much more attractive and convincing, especially in non-standard and complex settings. 

• the debate about ABC-based model choice is so far inconclusive in that we cannot guaran- 
tee the vahdity of the approximation, while considering that a "large enough" collection of 
summary statistics provides an acceptable level of approximation. Evaluating the discrep- 
ancy by exploratory methods like the bootstrap would shed a much more satisfactory light 
on this issue. 

• the method necessarily faces limitations imposed by large datasets or complex models, in 

that simulating pseudo-data may itself become an impossible task. Dimension-reducing 
technique that would simulate directly the summary statistics will quickly become neces- 
sary. 
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